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Abstract 

A goal of systems biology is to understand the dynamics of intracellu- 
lar systems. Stochastic chemical kinetic models are often utilized to accu- 
rately capture the stochastic nature of these systems due to low numbers 
I— ^ of molecules. Collecting system data allows for estimation of stochastic 

chemical kinetic rate parameters. We describe a well-known, but typically 
r \ impractical data augmentation Markov chain Monte Carlo algorithm for 

estimating these parameters. The impracticality is due to the use of rejec- 
tion sampling for latent trajectories with fixed initial and final endpoints 
s-^ which can have diminutive acceptance probability. We show how graph- 

I ^ I ical processing units can be efficiently utilized for parameter estimation 

in systems that hitherto wore inestimable. For more complex systems, 
^-H we show the efficiency gain over traditional CPU computing is on the or- 

^ der of 200. Finally, we show a Bayesian analysis of a system based on 

Michaelis-Menton kinetics. 

<N 

1 Introduction 

The development of highly parallelized graphical processing units (GPUs) has 
' I largely been driven by the video game industry for faster and more accurate 

. . real-time 3D visualization. More recently the graphics card industry has intro- 

^ duced general purpose GPUs for scientific computing. Much of the focus of this 

k> work has been on building parallelized linear algebra routines since these appli- 

cations are inherently amenable to parallelization [Galoppo et al., 2005, Volkov 
and Demmel, 2008, Kriiger and Westermann, 2005]. More recently the biolog- 
ical scientific community has taken interest for applications such as leukocyte 
tracking [Boyer et al., 2009], cluster identification in flow cytometry [Suchard 
et al., 2010], and molecular dynamic simulation of intracellular processes [Li and 
Petzold, 2010]. 

The statistics community has been slower to venture into this massively 
parallel area, but in the last couple of years a few papers have appeared us- 
ing GPUs to provide efficient analyses of problems that would otherwise by 
computationally prohibitive. Topics of these papers include statistical phylo- 
genetics [Suchard and Rambaut, 2009], slice sampling [Tibbits et al., 2010], 
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high-dimcnsional optimization [Zhou ct al., 2010], simulation of Ising models 
[Preis et al., 2009], approximate Bayesian computation (ABC) [Liepe et al., 
2010], estimating multivariate mixtures [Suchard et al., 2010], population-based 
Markov chain Monte Carlo (MCMC) and sequential Monte Carlo methods [Lee 
et al., 2009]. As the number of parallel cores increase, the presence of GPU 
computing will undoubtably grow. 

The use of parallel computing in the area of stochastic chemical kinetics is 
primarily focused on simulation of systems assuming known reaction parameters 
[Li and Petzold, 2010]. Some have used these simulations within an approximate 
Bayesian computation (ABC) framework for estimation of reaction parameters 
[Liepe et al., 2010]. The Bayesian approach presented here is a special-case 
of the ABC methodology and can be used as a gold-standard for comparing 
efficacy of the more general ABC methodology. This approach implements data 
augmentation MCMC (DA-MCMC) where the latent trajectories for chemical 
species are simulated at each iteration of the MCMC algorithm Marjoram et al. 
[2003]. Coupling this algorithm with GPU computing provides Monte Carlo 
estimates of parameter posteriors for sizeable systems in reasonable time frames. 

This article proceeds as follows. In section 2, we describe stochastic chemical 
kinetic models. In section 3, wc introduce the DA-MCMC Bayesian approach to 
parameter inference in these models. Section 4 discusses modifications required 
or beneficial for using this inferential technique on CPUs. Section 5 provides a 
simulation study to determine the computational efficiency gain of using CPUs 
relative to CPUs as well as full Bayesian analysis of a Michaelis-Menton system. 
Finally, concluding remarks and future research plans are discussed in section 
6. 

2 Stochastic chemical kinetic models 

Many biological phenomenon can be modeled using stochastic chemical kinetic 
models Wilkinson [2006]. These models are particularly useful when at least 
one species has a small number of molecules and therefore deterministic models 
provide poor approximations. In biology, this is common when considering intra- 
cellular populations such as the number of DNA, RNA, and protein molecules. 
Below we introduce the notation required for understanding these stochastic 
chemical kinetic models as well as methods used to simulate from them. 

Consider a spatially homogeneous biochemical system within a fixed volume 
at constant temperature. This system contains A'' species {Si, . . . , Sn} with 
state vector X{t) = {Xi{t), . . . jX^it))' describing the number of molecules of 
each species at time t. This state vector is updated through M reactions labeled 
iZi, . . . , Rm- Reaction j e {1, . . . , M} has a propensity aj{x) = 6jhj{x) where 
9j is the unknown stochastic reaction rate parameter for reaction j and hj{x) 
is a known function of the system state x. Multiplying the propensity by an 
infinitesimal r provides the probability of reaction j occurring in the interval 
\t, t + r). If reaction j fires, the state vector is updated to X{t + T) = X{t) + vj 
where Vj = {vij, . . . jVNj)' describes the number of molecules of each species 
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that arc consumed or produced in reaction j. 

The probabiUty distribution for the state at time t, p{t, x), is the solution of 
the chemical master equation (CME): 

d ^ 

Q^P{t,x) = ^ {aj{x-Vm)p{t,x- Vm) - aj{x)p{t, x)) . (1) 

This sohition is only analytically tractable in the simplest of models. In more 
complicated models with discretely observed data, standard statistical methods 
for performing inference on the 6'jS are unavailable due to intractability of this 
probability distribution. This necessitates the use of analytical or numerical 
approximations such as approximate Bayesian computation [Marjoram et al., 
2003]. 



2.1 Stochastic simulation algorithm 

The DA-MCMC algorithm described later requires forward simulation of the 
system from a known initial state which is accomplished using the stochastic 
simulation algorithm (SSA) [Gillespie, 1977]. The basis of this algorithm is the 
next reaction density function [Gillespie, 2001]: 

p(T,J = ilart) = ^-ao(a;,)e-»o(-')- 
ao{xt) 

where ao{xi) = X^jli '^j(^t)- Since the joint distribution is the product of 
the marginal probability mass function for the reaction indicator j and the 
probability density function for the next reaction time r, the reaction indicator 
and reaction time arc independent. SSA involves sampling the reaction indicator 
J = j with probability aj{xt)/ao{xt) and the reaction time r ~ Exp{ao{xt)), 
where Exp{X) is an exponential random variable with mean 1/A. The state 
of the system is incremented according to the state change vector Vj, time is 
incremented by r, and the propensities are recalculated with the new system 
state. This process continues until the desired ending time is reached. Many 
speedups/approximations for SSA are available and the reader is referred to 
Gillespie [2007] for a review. 



3 Bayesian inference 

Bayesian inference is a methodology that describes all uncertainty through prob- 
ability. Let y denote any data observed from the system and 9 = {6i,. . . , 6 m)' 

the vector of unknown parameters. The objective of a Bayesian analysis is the 
posterior distribution that can be found using Bayes' rule 

p(%) = «#ocp(,|^)pW (2) 
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where p(y\0) is the statistical model, often referred to as the likelihood, p{9) is 
the prior distribution encoding information about the parameters available prior 
to the current experiment, and p{y) is the normalizing constant to make p{0\y) 
a valid probability distribution. The second half of equation (2) indicates that it 
is rarely necessary to determine the normalizing constant p{y) when performing 
a Bayesian inference. 



3.1 Complete observations 

In the unrealistic setting where the system is observed completely, i.e. y = X = 
X[o,T] where ^[a,b] indicates all values for X on the interval [a, 6], stochastic 
reaction rates can be inferred easily. If we assume independent gamma priors 
for each of the stochastic reaction rates, i.e. 9j Ga{aj, j3j), where the gamma 
distribution is proportional to O'^' ^ ex.p{—9j(3j), then the posterior distribution 
under complete observations are independent gamma distributions 

0j\y'!i^'Ga{aj+rj,/3j+bj) (3) 

where rj is the number of times reaction j fired and bj is the integral of hj{-) 
over interval [0,r]. Mathematically, we write 

K 



b, = / hj{Xt)dt = ^/i,(Xt,_J (tk - tk-l) 
■^0 fe=l 



'=-1 (4) 



where jk G {!,..., M} is the reaction index for the fc*^ reaction, I(x) is the 
indicator function that is 1 when x is true and otherwise, tk is the time of 
the k^^ reaction, and a total of K reactions fired in the interval [0,r]. The two 
values rj and bj are the sufficient statistics for parameter 6j and are utilized in 
Section 4.3 to increase computational efficiency. 



3.2 Discrete observations 

In the more realistic scenario of perfect but discrete observations of the sys- 
tem, the problem becomes analytically intractable and, even worse, numerical 
techniques are challenging [Wilkinson, 2006]. This challenge comes from the 
necessity to simulate paths from Xt._^ to Xt- where and ti are consecutive 
observation times [Boys et al., 2008]. These simulated paths are necessary when 
using a Gibbs sampling approach presented in the following section that alter- 
nates between 1) draws of parameters conditional on a trajectory and 2) draws 
of trajectories consistent with the data and conditional on parameters. 

We define the discrete observations as y = {X^. : i = 0, . . . ,n} and update 
Bayes' rule in equation (2) to include the unknown full latent trajectories X. 
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The desired posterior distribution is now the joint distribution for the underlying 
latent states and the unknown parameters 

n 

p{e,x\y)^i{yo = XtMO)lli{yi = Xu)p {x^t,_„t,^\e,Xt,_,) 

i=l 

n 

= p{0)l[p{X^t,.,M)\^,Xt,_, = yi_uXu = Vi) 

i=l 

where p (X(4._j^t.)|6',Xt._j = = yi) denotes the distribution for the la- 

tent state starting at and ending at over the time interval i.e. 
the distribution for a continuous-time Markov chain with fixed endpoints. Since 
this distribution is not analytic - even up to a proportionality constant - the 
distribution p{9,X\y) is not analytic. 

3.3 Markov chain Monte Carlo 

A widely used technique to overcome these intractabilities in Bayesian analysis 
is a tool called Markov chain Monte Carlo (MCMC). In particular, we utilize a 
special case of MCMC known as Gibbs sampling [Marjoram et al., 2003]. This 

iterative approach consists of two steps: 

1. draw 6'W ~p(6'|X(^-i),y) and 

2. draw ^p{X\9(*\y) 

where superscript (i) indicates that we use the draw from the i*'^ iteration of the 
MCMC, p{6\X'^''~^\y) is the distribution for the parameters based on complete 
trajectories, and p{X\6^''\ y) is the distribution for the complete trajectory based 
on the current parameters. The joint draw {O^^^ , ) defines an ergodic Markov 
chain with stationary distribution p{9,X\y). 

In order to utilize this Gibbs sampling approach, samples from the full con- 
ditional distributions for 9 and X are required, i.e. p{9\X,y) and p{X\9,y), 
respectively. Recognize that p{9\X,y) = p{9\X) since y C X due to X rep- 
resenting the entire latent trajectory at all time points as if wc had complete 
observations. Under complete observations, this full conditional distribution 
was already provided in equation (3). Therefore, samples are obtained by cal- 
culating the sufficient statistics in equation (4) based on X rather than y and 
drawing independent gamma random variables for each reaction parameter. 

3.3.1 Rejection sampling 

The full conditional distribution for X is, again, analytically intractable, but it 
is still possible to obtain samples from the distribution using rejection sampling 
[Robert and Casella, 2004, Ch. 2]. To accomplish this, we use Bayes' rule on 
the full conditional for X: 

n 

p{X\9,y) ^l[p{X^t..„t.)\^,Xt,_, = y^-i,Xt, = y^) 

i=l 
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where p{0) is subsumed in the proportionahty constant. A rejection samphng 
approach consistent with this full conditional distribution can be performed 
independently for each interval (tj-i, t,) for i = 1, . . . , n as follows 

1. forward simulate X on the interval via SSA using parameters 6 
with initial value Xt^_i = yi-i, and 

2. accept the simulation if Xt. is equal to yi otherwise return to step 1. 

As with any rejection sampling approach, the efficiency of this method is de- 
termined by the probability of forward simulation from Xt^_^ = yi-i using 
parameters 6 resulting in Xt. at time ti. For these stochastic kinetic models, 
this probability can be very low and therefore we aim to take advantage of the 
massively parallel nature of graphical processing units to forward simulate many 
trajectories in parallel until one trajectory succeeds. 

4 Adaptation to graphical processing units 

Parallel processing is clearly only advantageous if the code is parallelizable. In 

the DA-MCMC algorithm, each step of the algorithm given in Section 3.3 can 
be parallelized. The first step involves a joint draw for all stochastic reaction 
rate parameters conditional on a simulated trajectory. These parameters can 
be sampled independently as shown in equation (3). We can conceptually send 
the sufficient statistics for each reaction off to its own thread to sample a new 
rate parameter for that reaction. The second step involves rejection sampling to 
find a trajectory that satisfies interval-endpoint conditions. Both of these steps 
are candidates for parallel processing. 

The theoretical maximum speed-up for parallel processing versus serial pro- 
cessing is bounded by Amdahl's quantity 1/(1 — P + P/C) where P is the 
percentage of time spent in code that can be parallelized and C is the number 
of parallel cores available [Amdahl, 1967, Suchard et al., 2010]. The first step in 
the DA-MCMC can only be parallelized up to the minimum of M, the number 
of reactions, and C . With the chemical kinetic systems under consideration, 
M « C and therefore the gain from parallelizing this step is minimal. In 
contrast, the gain in parallelizing the rejection sampling step in DA-MCMC is 
entirely dependent on the acceptance rate. As the acceptance rate drops, P — > 1 
and maximum performance gain from parallelization is achieved. In our appli- 
cation, the acceptance rate is often very low and therefore we focus on efficiency 
gained from parallelizing the rejection sampling step. 

Consider initially the goal of sampling a trajectory from some initial state 
Xo at time given parameter to a final state xi in time 1. Using SSA, the 
probability of simulating this trajectory by starting at xq using parameter 9 
and attaining Xi has probability p, which is generally unknown. The num- 
ber of simulations required before a successful attempt is a geometric random 
variable with expectation 1/p. Therefore as the probability decreases, the ex- 
pected number of runs increases and the system becomes amenable to parallel 
processing. 



6 



A simple approach to parallclization is to have each computing thread at- 
tempt one simulation and determine whether that simulation was successful, 
i.e. the final state is xi at time 1. If multiple threads are successful, then 
one of those simulations is sampled \miformly. The sufficient statistics for that 
simulation are calculated and the DA-MCMC can continue by sampling rate 
parameters based on those statistics. In the following subsections, we discuss 
the implementation details required to turn this simple idea into a an efficient 
reality. 

4.1 Independent pseudo- random number streams 

Most current GPU-parallelized Monte Carlo algorithms know, prior to parallel 
kernel invocation, how many random numbers will be needed by each thread 
and can therefore use a "skip-ahead" technique [L'Ecuyer et al., 2002] for ob- 
taining independent streams of pseudo-random numbers [Lee et al., 2009]. This 
technique relies on one long pseudo-random number stream. The key idea is 
to have the next thread skip over the n random numbers needed by the thread 
before it. Unfortunately, in the SSA algorithm the number of random numbers 
required for each thread is random and therefore this approach becomes infea- 
sible unless we are willing to settle for n sufficiently large to ensure a minuscule 
probability of overlap in the sequences. Even then, we are inviting computa- 
tional inefficiency since the "skip-ahead' requires O(logn) operations [L'Ecuyer 
et al., 2002]. 

Instead, we use dynamic creation of pseudo-random numbers Matsumoto 
and Nishimura [2000] which has already been used in the SSA context Li and 
Petzold [2010]. This method creates a set of pseudo-random number streams 
based on the Mersenne-Twister family of generators. A hypothesis concerning 
statistical independence of these streams is given in Matsumoto and Nishimura 
[2000]: 'a set of PRNGs [pseudo-random number generators] based on linear 
recurrences is mutually "independent" if the characteristic polynomials are rel- 
atively prime to each other.' These authors state that there are many PRNG 
researchers who agree with this hypothesis. 

To balance memory constraints (discussed in Section 4.4) with execution 
speed, we implement one "independent" Mersenne-Twister (MT) per warp, 
the set of threads that receive instructions simultaneously. Current generation 
GPUs have 32 threads per warp and, again balancing memory with execution, 
we have implemented MTs that utilize a set of 40 integers for calculation of 
pseudo-random numbers. Figure 1 depicts threads within a warp accessing one 
MT. Within a warp, the first 20 threads execute simultaneously followed by the 
remaining 12 in order to ensure proper updating of the MT. To update the MT, 
thread i records the MT state integers at locations i, i+1 modulo (mod) 40, i-l-20 
mod 40 (operations specific to this 40-integer MT) Matsumoto and Nishimura 
[2000]. The thread performs bit- wise operations on these integers and records 
the output as the updated state at location i Matsumoto and Nishimura [1998]. 
After all threads have completed, they compute the pseudo-random number 
required for the SSA algorithm. For the following round of pseudo-random 
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numbers, the threads are shifted with respect to the MT state such that thread 
i is now ahgned with MT state i+32 mod 40, a process that continues ad infini- 
tum. After all threads within a warp have completed the SSA algorithm, the 
MT state plus the last used state index are written to global memory for use in 
the following kernel invocation. 
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Figure 1: A depiction of threads (black boxes, thread 1 is greyed) within a warp 
accessing Mersenne-Twister states (red boxes). A) The first 20 threads (threads 
1 and 20 depicted) accessing states i, i + 1 mod 40, and i + 20 mod 40 where i 
is the tread id. B) These threads update their corresponding Mersenne-Twister 
state (filled red boxes). C-D) Threads 21 to 32 now perform steps A and B. E) 
For the next round of pseudo-random numbers, the threads shift so that thread 
1 starts at the next Mersenne-Twister state to be updated. 



4.2 Bypass thread simulation 

Ideally once one thread is successful, current and future threads should be 
aborted. One aspect of GPU computing that varies from standard parallel 
processing is that no assumption can be made about the order in which threads 
occur. A statement such as 'if all threads with global thread id less than i 
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have failed, then perform simulation' cannot be made since it is unknown which 
threads have already made an attempt. Nonetheless, the same effect can arise 
by creating a global variable that indicates when a thread has been successful, 
but care is required. 

At any given time during the parallel execution of rejection sampling, threads 
can be placed into three categories: already-completed-and-failed, in-progress, 
and waiting-to-be-executed. Once an in-progress thread is successful, it writes 
to global memory indicating that it was successful and stores its pseudo-random 
number state. For already-completed-and-failed threads no efficiency gains are 
possible. For waiting-to-be-executed threads a simple check at the onset of simu- 
lation to the global success variable is sufficient to bypass the thread simulation 
if a thread has already been successful. Finally, in-progress threads could peri- 
odically check whether another thread has been successful, but this adds unnec- 
essary overhead and, more importantly, could bias results. Instead, in-progress 
threads are allowed to complete even if another thread has been successful in the 
meantime. In the unlikely event that another thread is successful, it is added 
to the list of successful threads and at the completion of all threads one of the 
successful threads is sampled uniformly. 

4.3 Sufficient statistics 

A naive approach to recording the SSA trajectory is to record the state and 
time when each reaction fires which requires NK + K integers/foats where 
K is the number of reactions that fired. A parsimonious way of representing a 
trajectory is to record the initial state vector as well as the times and identity of 
each reaction and recreating the trajectory when needed. The memory storage 
requirement for this approach is A'' + 2K integers/floats. In addition, K is 
random and therefore memory management is required to deal with varying 
array sizes. 

Recall that the full condition distribution for the rate parameters depends 
only on the sufficient statistics in equation (4). So rather than recording the 
entire trajectory, the sufficient statistics can be recorded which reduces the 
memory requirement down to 2M integers/floats where, often, M k. N and 
K » M. If inference on the trajectories is required, then this can be accom- 
plished after the DA-MCMC is complete by rerunning the rejection sampling 
for each (or a subset) of the MCMC iteration values for the rate parameters. 

A solution that reduces the memory requirements even further and does not 
require rerunning the rejection sampling is to record the initial PRNG state 
that resulted in a successful simulation in lieu of both the full trajectory and 
the sufficient statistics. The memory storage requirement is only 41 integers 
(40-integer MT state plus the last used state index) which clearly scales with 
N , M, and K. For calculation of sufficient statistics or any trajectory inference, 
the trajectory is re-simulated with parameters corresponding to that iteration 
in the MCMC and using the initial PRNG state that was previously successful. 
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Figure 2: The CUD A memory model. 



4.4 Efficient memory usage 

A trade-ofF made for the massively parallel nature of the GPU is the amount of 
memory available for each thread. This is an overarching concern that has been 
covered elsewhere [Lee et al., 2009, Suchard et al., 2010], but we now discuss 
how this concern can be addressed in the parallel rejection sampling framework. 
Figure 2 provides a diagram of the CUDA memory model. Importantly, any 
access to memory below the threads (in green) is relatively slow and memory 
depicted above the threads is fast, but very limited. 

Table 1 provides hardware memory constraints on the Tesla TIO GPU and 
the algorithm memory allocation described here. Constant memory can be used 
for quantities that do not change during the GPU kernel execution and has an 
8k cache for efficient access [Kirk and Hwu, 2009, Ch. 5]. Therefore it is a 
convenient location for the stoichiometry matrix and reaction rate parameters 
(which only change outside of the GPU kernel) since these are common to 
all blocks and threads. Without considering efficient sparse matrix storage, the 
number of integers needed to store both the matrix and parameters is M(A^ + 1) 
which, given the systems of interest, easily fits within the constant memory 
cache. 

After constant memory, efficient memory access is achieved by utilizing reg- 
isters and shared memory. Registers are restricted to automatic scalar variables, 
i.e. not arrays, that are unique to each thread [Kirk and Hwu, 2009, Ch. 5]. If 
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Table 1: Relevant hardware memory constraints in kilobits (integers) for the 
Tesla TIO GPU and algorithm memory allocation. 



Memory type 


Amount (integers) 


Algorithm allocation 


Registers per block 
Shared per SM 


32 X 512 
4 kb (8 X 512) 


Thread system time 

Thread loop variables 

Twister state during SSA simulation 

Thread system state f 

Thread reaction propensities f 


Local per thread 
Global 

Constant 


16 kb (4096) 
4 Gb (« 10») 

64 kb (16,384) 


Thread system state f 
Thread reaction propensities f 
Success counter 
Successful twister state 
Twister states when not in use 
Reaction rate parameters 
Stoichiometry matrix 



t If shared memory is available, thread system state and reaction propensities 
are moved from local to shared memory. 



we are using the maximum number of threads per block of 512, then we are lim- 
ited to 32 registers. Clear candidates for register storage are system time in the 
SSA simulation and simple loop variables. To determine the number of registers 
per thread compile using the nvcc compiler option — ptxas-options=-v. The 
SSA algorithm currently uses all 32 registers per thread and therefore allows us 
to use the maximum of 512 threads per block. 

Shared memory is fast-access memory that can be utilized by all threads 
within a block. In this implementation, we take advantage of the fast-access by 
storing pseudo random number generator states. Also, depending on the system 
size, we can store each thread's SSA current system state, an A''-integer array, 
and possibly even the M-float array containing the reaction propensities. The 
PRNGs take up 2560 bytes of shared memory per block leaving 13824 bytes 
left over to store the system state and/or the reaction propensities. Therefore, 
if + M < 6, then both the state and propensities can be stored, otherwise 
there is not enough memory available to store both. This limitation can be met 
by either decreasing the number of threads per block to increase the amount of 
available shared memory per thread or by using local or global memory. In our 
experience, the optimal method generally depends on the system being studied. 

Remaining variables are stored either in local or global memory depending 
on whether they are thread-specific or not. For example, two important global 
variables include the indicator of whether a thread has been successful and the 
array of successful MT states. 
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5 Simulation example 



We compare the efficiency of a GPU implementation to conventional CPU im- 
plementation using the model Michaelis-Menton model. This widely known 
model is described by the following reaction graph: 



where E isa protein enzyme, S' is a substrate that is converted into a product P, 
and ES is an intermediate species for this production. The propensities for the 
three reactions are ai{X) = 9\E ■ S, a2{X) = O2ES, and a^i^X) = O^ES where 
X = {E, S, ES, P). This system has two conservation of mass relationships: 
Eo = E + ES and So = S + ES + P. 



For comparing GPU vs CPU timing, we considered only the rejection sampling 
step in Section 3.3.1 of the DA-MCMC algorithm rather than the entire MCMC 
algorithm. This was done since the probability of rejection is highly dependent 
on the current MCMC parameter draws and to obtain accurate timing a large 
quantity of MCMC iterations is required to eliminated timing bias due to ex- 
ploration of the parameter posterior. Unfortunately, obtaining a reasonable 
quantity of MCMC iterations on the CPU is simply not feasible on a reasonable 
time scale. Therefore, we compare timing for the rejection sampling portion of 
the algorithm only, although we expect the efficiency gain for the GPU should 
be comparable if the entire MCMC timing could be analyzed. 

It is important to note that if the rejection sampling step had no rejections, 
then we would expect the CPU to perform comparable to the GPU and possibly 
even better if GPU overhead is considerable. Therefore it is of interest to study 
the efficiency gain as a function of the difficulty of the rejection sampling step. 
Figure 3 compares the efficiency of one core of a 2.66GHz Intel Xeon CPU vs one 
Tesla TIO GPU where increasing difficulty of rejection sampling is equivalent to 
increased expected draws. For high acceptance probability rejection sampling 
schemes, the efficiency gain is modest and may not be worth the trouble of con- 
verting code to GPU use. In contrast for low acceptance probability rejection 
sampling schemes, the efficiency gain is around 200, meaning the GPU version 
will perform 200 times faster than the CPU version. The efficiency gain ap- 
pears to hit an asymptote around 200, for our algorithm implementation while 
the computation time involved appears to be exponentially increasing as the ac- 
ceptance probability decreases. Therefore incredibly low acceptance probability 
rejection sampling schemes could still not be handled on a GPU, but may be 
suitable to simultaneous use of multiple GPUs. 




(5) 



5.1 GPU vs CPU 
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Figure 3: The multiplicative increase in rejection sampling efRciency (black) and 
the GPU time for one iteration (red) with 95% intervals (dashed) for one core of a 
2.66GHz Intel Xeon CPU vs a Tesla TIO GPU. The effect of expected number of 
samples required for each acceptance was studied by holding constant 02 — 0.001 
and 6*3 = 0.1 while increasing di from 0.001 to 0.00245 in the Michaelis-Menton 
system of equation 5. 



5.2 Bayesian inference 

The ultimate goal of this Bayesian analysis is performing inference in stochastic 
chemical kinetic models on both the unknown reaction rate parameters and 
the latent trajectory between data points. The Michaelis-Mention system was 
simulated with true parameters 9i — 0.001, 02 = 0.2, and 6*3 — 0.1 from time 
up to time 100. The data used are provided in Table 2 where both the ES- 
complex and product P are initialyl zero. Through the monotonic decrease in 
iS*, it is clear this system converts the substrate in the product. The enzyme 
quantity initially decreases drastically as it bonds to available substrate, but 
then as substrate is converted to product more unbound enzyme is available. 

Table 2: Measurements taken from a simulated Michaelis-Mention 



system with parameters 


Oi = 


0.001, 


02 = 


0.2, and 63 


= 0.1 






Time 


10 20 


30 


40 


50 


60 70 


80 


90 


100 


E 


120 71 76 


81 


80 


90 


90 104 


103 


109 


109 


S 


301 219 180 


150 


108 


86 


61 52 


35 


29 


22 
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A non-informative independent prior is assumed for all reaction rate param- 
eters, namely p{9) oc (^i^2^3)~^- This prior is found as the limit of the gamma 
priors when both the shape and rate parameters approach zero, but the gamma 
posterior of equation (3) is still proper with aj = Pj = OVj if each reaction 
occurs at least once. 

The DA-MCMC algorithm was run for 10,000 burn-in iterations and then 
another 40,000 iterations were used for inference. Convergence was monitored 
informally via traceplots and formally using the Gelman-Rubin diagnostic [Gel- 
man and Rubin, 1992, Brooks and Gelman, 1997]. While lack of convergence is 
detected, samples are discard as bum-in. Post burn-in, no lack of convergence 
was detected. 

Figure 4 provides posterior histograms for the reaction rate parameters based 
on the observations in Table 2. The bivariate contour plot of 0i and 62 indicate 
that the value 9i + 62 is estimable from the data, but the individual values for 
61 and 62 are hard to estimate. This identifiability issue is common in systems 
biological parameter inference where equilibrium reactions abound. 

One advantage of Bayesian analyses is trivially obtained estimates and un- 
certainties for any function of the model parameters. Figure 4 provides posterior 
histograms for both Kd = ^2/^1 and Km = [82 + d^l/di known as the disso- 
ciation constant and Michaelis constant, respectively. Although many methods 
have been developed to estimate the Michaelis constant, dating at least to the 
Lineweaver-Burk plot from 1934 Lincwcavcr and Burk [1934], few methods pro- 
vide an uncertainty on the estimate. Based on the plot in Figure 4, there is 95% 
probability that the true is in the range 246 to 343. 

Figure 5 provides point-wise credible intervals for the four Michaelis-Menton 
system species. Due to mass conservation, we see that E and ES are mirror 
opposites of each other and S and P are very close to being mirror opposites 
of each other. The scientific questions of interest that are answered by these 
trajectories include when was the 5 — )■ P conversion 90% complete? or what 
is the probability that ES crossed the 90 molecule threshold? Bayesian analyses 
can trivially answer these questions while it remains difficult for other statistical 
methods. 

6 Discussion 

We presented a Bayesian analysis of stochastic chemical kinetic models that 
utilize a data augmented MCMC algorithm where the augmentation infers la- 
tent trajectories sampled via rejection sampling. This dramatic increase in 
efficiency when utilizing a GPU will allow for analysis of vastly larger systems 
in reasonable amounts of time. The timing comparison in this manuscript only 
compared rejection sampling and therefore the results are biased slightly in favor 
of the GPU. Further work is required to explore the efficiency gain of the entire 
MCMC, but we suspect the results to be very similar to the results presented 
here and the benefit of letting the CPU algorithm run for months is marginal. 
The observations in this manuscript were discrete but perfect. Clearly this 
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is an unrealistic scenario in practical applications since we rarely obtain perfect 
observations of the underlying system. Realistic models naturally incorporate 
error in one of two ways: 1) the true value is within a threshold of that observed 
or 2) all values arc possible but values closer to that observed arc more probable. 
In the first approach, everything discussed in this manuscript is still applicable 
since forward simulations that are consistent with the observations will still 
be needed. These simulations could easily be harder to obtain and therefore 
more amenable to parallelization. In the second approach, all trajectories are 
possibilities and therefore rejection sampling is not applicable. We are exploring 
the use of an independent Metropolis-Hastings proposal and methodologies that 
exploit creation of multiple independent proposals simultaneously. 

Approximate Bayesian computation approaches have already been imple- 
mented on a GPU in a package called ABC-SysBIO [Licpc ct al., 2010]. This 
was implemented in Python utilizing the PyCUDA wrapper [Klockner et al., 
2009] to access the CUDA API. Since Liepe et al. [2010] devotes only two para- 
graphs relevant to this manuscript, it is unclear how PyCUDA implements the 
algorithm, e.g. random number generation, memory efficiency, etc., and whether 
it is competitive with the implementation discussed in this manuscript. 

The few papers discussing Bayesian inference on GPUs published to date 
have shown remarkable efficiency gains. Since this field is computation heavy, 
this increased efficiency should lead to Bayesian techniques being much more 
widely adopted than they are today as the capacity to solve highly complex 
problems in reasonable time frames increases. 
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Figure 4: Posterior histograms for stochastic reaction rate parameters as well 
as the stochastic dissociation and Michaehs constants, Kd = O2/O1 and Km = 
[O2 + ^3]/^! respectively, and a bivariate contour plot (quantiles: 2.5%, 25%, 
50%, 75%, and 95%) for the joint posterior of 9i and 62 with true values (red) 
based on the data in Table 2 and using the DA-MCMC algorithm. 
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Figure 5: Posterior 95% point-wise credible intervals (black) and three random 
draws from the posterior (colored) for the state trajectory. 
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